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A METHOD FOR GENE MAPPING FROM GENOTYPE AND PHENOTYPE 
DATA 

Field of the invention 

The present invention relates to a method for gene mapping from genotype and 
phenotype data, which method utilizes- linkage disequilibrium between genetic 
markers m/, which are polymorphic nucleic acid or protein sequences or strings of 

single-nucleotide polymorphisms deriving from a chromosomal region. 
Background of the invention 

The use of linkage disequilibrium (LD) in detecting disease genes has recently 
drawn much attention in genetic epidemiology. LD is evaluated with association 
analysis, which, when applied to disease-gene mapping, requires the comparison of 
allele or haplotype frequencies between the affected and the control individuals, 
under the assumption that a reasonable proportion of disease-associated chromo- 
somes has been derived from a common ancestor. Traditional association analysis 
methods have long been used to test the involvement of candidate genes in diseases 
and, in special circumstances, to fine-map disease loci found by linkage methods. 
The testing has mostly been done using simple two-point measures. 

Improved statistical methods to detect LD have been presented lately (Terwilliger 
1995; Devlin et al. 1996; Lazzeroni 1998; McPeek and Strahs 1999; Service et al. 
1999). The newer methods are based on statistical models of LD around a disease 
susceptibility (DS) gene. Genomic regions - rather than alleles - that are shared 
among affected individuals, are searched for. The recombination history from the 
common ancestor to the present day is taken into account with more or less simpli- 
fied statistical models. The power of these methods, as well as their ability to local- 
ize the correct position of the DS gene, has been shown to be better than that of tra- 
ditional methods. Some of the models are robust against high levels of etiologic 
heterogeneity (McPeek and Strahs 1999; Service et al. 1999). However, the meth- 
ods contain assumptions about the inheritance model of the disease and the structure 
of the survey population, and the effects of violations of these assumptions in the 
real data are not known. In addition, they can only consider association of one re- 
gion at a time. Thus, they are currently best suited for fine mapping rather than 
complex disease mapping or genome screening. The methods also tend to be com- 
putationally heavy. 
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The present inventors have recently introduced a so-called haplotype pattern mining 
(HPM) method (Toivonen et al. 2000a and 2000b). In the HPM method, haplotype 
patterns are ordered by their strength of association with the phenotype, and all hap- 
lotype patterns exceeding a given threshold level are used for prediction of disease 
susceptibility gene location. The advantage of the HPM method is that it is model- 
free as it does not require any assumptions about the inheritance model of the dis- 
ease The haplotype patterns are allowed to contain gaps and therefore the HPM 
method is quite robust against mutations and to missing and erroneous data. How- 
ever the basis of the HPM method is that haplotypes, i.e. separate vectors of alleles 
of markers, are available. As will be explained below, this requirement causes vari- 
ous problems in gene mapping methods, and thus also in the HPM method. 
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Zhang et al (2002) have extended the HPM method to allow simultaneous use of 
haplotype data of related individuals with quantitative trait from an extended pedi- 
gree This is done by employing the Quantitative Pedigree Disequilibrium Test 
(QPDT) statistic to measure the strength of association between haplotype and a 
quantitative trait. 

The standard procedure in association-based gene mapping is to 1) ascertain indi- 
viduals carrying the trait of interest and their family members (atleast parents), 2) 
genotype the individuals, 3) derive the haplotypes computationally using genotypes 
within families, and finally to 4) find associations in the haplotypes (gene mapping). 

Even though the actual association analysis is done on sole case and control haplo- 
types, obtaining these haplotypes requires the parents of the affected individuals to 
be genotyped as well: vast majority of haplotyping programs available expect the 
parental genotypes to exist. This means that the parents first have to be recruited, 
which is not always straightforward, as they might no longer be alive, or cannot be 
reached, or refuse from giving blood samples. Genotyping more individuals is labo- 
rious and elevates the study expenses: per every case or control, 3 individuals will 
be genotyped instead of just one, so genotyping is done on 3 times as many persons 
as there are cases and controls, m case the non-transmitted parental chromosomes 
could be used as controls, a case and his/her parents contributes ' one case-control 
pair, in which case the genotyping effort is 1.5 times higher than the number of ca- 
ses and controls needed. 

As an alternative to these haplotyping approaches, some methods for direct haplo- 
typing from population-based data have indeed been presented, but the problem 
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with these is that they still produce a lot of mistakes, which is a very bad starting 
point for any haplotype based association program. 

There is no straightforward way to use genotypes as input for a method that is de- 
signed for haplotypes. Given a genotype, it is in principle possible to consider all 
haplotype configurations available from the genotype, and to run a haplotype gene, 
mapping method on the different configurations of chromosomes. In practice, how- 
ever, this is not feasible for marker maps of reasonable size due to a combmatona 
explosion: given a genotype with N heterozygous markers, the number of afferent 
possible haplotype configurations is *» (or 1, if N = 0). For instance, for N=100 
the number of possible haplotype configurations is about 6*10 . 
Zhang and Zhao (2002) have studied linkage disequilibrium mapping directly with 
genotype data. Their approach is model-based and the method is based on the decay 
of haplotype sharing (DHS) method for haplotype data developed by McPeek and 
Strahs (1999). The approach of Zhang and Zhao is based on explicitly considering 
all possible haplotype configurations. Since this is not feasible for marker maps of 
interesting sizes - as was described above - they apply complex and error-prone 
techniques to prune the number of haplotype configurations they need to consider. 
Further, in this method, data consisting of multiallele (microsatelhte) loci only - 
thus no SNPs (single nucleotide polymorphisms) or any other type of markers - is 
considered. In short, the method works as follows: it is supposed that there are two 
alleles in the disease locus: the disease causing allele D and the normal allele d. The 
basic idea is to treat the chromosomes in affected individuals as if they were a ran- 
dom sample from chromosome population consisting of both chromosomes with the 
D allele and d allele. Chromosomes in normal individuals are assumed to be a ran- 
dom sample of chromosome population only consisting of d chromosomes. Next a 
likelihood for individual haplotypes is formulated in the same way as presented by 
McPeek and Strahs (1999) where the probability of an observed haplotype is de- 
pendent of number of generations from original disease mutation, recombination 
rates between markers, and mutation rate in marker loci. The gap between using 
, haplotype data as the starting point (as in McPeek and Strahs) and genotype data (as 
Zhang^nd Zhao present) as the starting point is bridged with following inference 
for each genotype gi , there are several haplotype pairs that are compile w«h .* 
(2 N -' where N is the number of heterozygote sites in the genotype). The likelihood 
for an observed genotype is the sum of probabilities of each possible haplotype pa*, 
5 where the probabilities of individual haplotypes are formulated as above. The ge- 
netic parameters of interest (such as location of the disease locus, mutation rate and 
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disease allele age) are then estimated by using EM algorithm. The large number of 
possible ancestral haplotypes prerequisites pruning out any too rare haplotypes; the 
haplotype frequencies are estimated with Markov model, and any which are below 
some prespecified level are left unconsidered. 

5 The approach of Zhang and Zhao has the following serious drawbacks. First, the 
principle of Zhang and Zhao is to explicitly consider all possible haplotype configu- 
rations. This is feasible only with very small marker maps. Second, to avoid the first 
problem and to extend the applicability of the approach to larger maps, Zhang and 
Zhao apply additional pruning techniques to reduce the number of haplotype con- 

10 figurations they need to consider. However, those techniques are complex and error- 
prone Third, their approach is based on summing probabilities of different haplo- 
type configurations. Such an approach is not directly applicable to pattern-based 
mapping methods such as HPM. 

Curtis et al (2001) studied the use of an artificial neural network to detect associa- 
15 tion between disease and multiple marker genotypes. The pattern-recognition prop- 
erties of the network were used in the hope that marker haplotypes implicit in the 
genotypes differed between cases and controls in a way which led to the network 
being able to classify the subjects correctly, according to their marker genotype. 

Summary of the invention 

The object of the present invention is to provide a model^free and computationally 
effective method allowing direct association analysis on genotype rather than haplo- 
type data, which overcomes the above-mentioned drawbacks. The invention offers 
remarkable advantages by avoiding the technically difficult, costly and sometimes 
impossible steps of recruiting and genotyping family members, as well as by avoid- 
ing some of the error sources present in population-based haplotyping methods. 

The above-mentioned object is achieved in accordance with the invention by the 
method for gene mapping from genotype and phenotype data, which utilizes linkage 
T^ZiL between genetic "markers m U which are polymorphic nucleic acid or 
protein sequences or strings of single-nucleotide polymorphisms deriving from a 
chromosomal region. The method according to the invention is characterized by the 
following steps: 

i) all marker patterns P that satisfy a pattern evaluation function e(P) are 
searched from the data, wherein 



20 



25 



30 



WO 03/085585 



PCT/FI03/00248 



15 



a the marker patterns are expressions involving the marker-allele as- 
signments and zero or more of the following: individual covanates, 
environmental variables and auxiliary phenotypes; and 

b the pattern evaluation function e(P) involves some statistical measure 
5 ' of the association between the marker pattern P and the phenotype be- 

ing studied, 

by testing each marker of pattern P against the corresponding allele pan- 
in genotype G, effectively finding out if there is a possible haplotype 
configuration of G which matches P and counting the possible matches 
10 as matches, 

ii) each marker m/ of the data is scored by a marker score s(mi), which is a 
function of the set S t defined as the set of marker patterns overlapping 
the marker m f and satisfying the pattern evaluation function e as de- 
fined in step (i), and 

iii) the location of the gene is predicted as a function of the scores s(mO of 
all the markers m; in the data and is based on maximizing the score if 
the scoring function is designed to give higher scores closer to the gene, 
and on minimizing the score if the scoring function is designed to give 
lower scores closer to the gene, as is the case for instance when the 
scores s(m0 are marker-wise p values. A computer-readable data stor- 
age medium according to the invention has computer-executable pro- 
gram code stored thereon, said executable program code being opera- 
tive to perform a method of any embodiments of the invention when 
executed on a computer. 

A computer system according to the invention is programmed to perform the 
method of any embodiments of the invention. 

As used herein the term 'haplotype' defines a vector of alleles in a single chromo- 
some. Also, as used herein the term 'genotype' defines a vector of (unphased) allele 
pairs in a chromosome pair. 

The term •microsatellite' used defines a small run (usually less than 0.1 kb] > of 'tan- 
dem repeats of a very simple DNA sequence, usually 1-4 bp, for example (CA)n It 
has been used as the primary tool for genetic mapping during the 1990s. Multialle- 
lic genetic locus' is a gene with high level of variation; there are several types of 
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variants in the gene loons, each with reasonably high frequency. 'SUP. single an- 
Z£ ^.ymorphism, defines any poiymorphic variation a, a single ^otitic. 
Al*ongh lerinfonnadve than microsatellites, SNPs are more amenable to large- 
scale automated scoring. 
5 Brief description of the drawings 

Figure 1 shows the localization accuracy of HPM-G compared to HPM: the » « 
shows which fraction of simulated data sets is in the predicted region, the length of 
which is given on the x axis. 

Figure 2 shows the effect of sample size on localization accuracy with a) HPM-G 
10 (sample size in people) and b) HPM (sample size in chromosomes). 

Figure 3 shows the effect of missing data (5%, 10%) on localization accuracy with 
1 1 HPM-G 1(150 affected and 150 control individuals) and b) HPM (200 disease- 
associated and 200 control chromosomes). 

Figure 4 shows the effect of 100 permutations on localization accuracy. 
15 DetaUed description of the invention 

The object of the present invention is to provide a method for gene mapping from 

may also be a combination of several phenotypes. 

The method according to the invention, also named as haplotype pattern minmgin 
aenotype data (HPM-G), uses data mining methods in LD-based gene mappmg. The 
mlTuS Tooth genomes and haplotypes as input. In dishes with reasonable 
—on, affected individuals are likely to have higher »es o 
Lociated marker alleles n« the DS gene man control mdrvrdu ah 
of marker alleles which are more frequent in genotypes of affected mdmduals than 
n loupes of unaffected individuals, are searched for in me data, w.thou, aasump- 
uTZt me mode of inheritance of the disease. These comb.nat.ons, marker pat- 
Z or haplotype pauems, are sorted by the strength of their assocanon wdh 
oTeLe and Ute resuWng Us. of marker or haplotype patterns is used m locahzmg 
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the DS gene. Terms marker pattern and haplotype pattern denote the same concept, 
and are used interchangeably in this text. 

The method according to the present invention is an algorithm-based extension of 
traditional association analysis. It works with a non-parametric statistical model and 

5 without any genetic models. The localization power of the method of the invention 
is high even in cases, where multiple independent founder mutations are allowed, 
and the frequency of the most common mutation in the affected chromosomes var- 
ies between 5-15% at realistic sample sizes (100 affected individuals and a similar 
number of population controls). In addition, the experiments suggest that the 

10 method is highly robust against missing data. Since HPM-G can handle high de- 
grees of etiologic heterogeneity, it can be successful in complex disease mapping. 

LD the non-random association of marker alleles and haplotypes to the disease, is 
likely to be strongest around the DS gene; consequently the locus is likely to be 
where most of the strongest associations are. In the HPM-G method according to 

15 the invention, we search for shared, flexible haplotypes that may contain gaps, and 
find out, which ones are strongly associated with the disease status. We then use a 
non-parametric model for predicting the DS locus, on the basis of the locations of 
the haplotypes. Permutation tests can be used to contrast the results against the null 
hypothesis that there is no gene effect. 

20 Marker or Haplotype Patterns and Disease Association 

We examine linkage disequilibrium by looking for marker or haplotype patterns that 
consist of a set of nearby markers, not necessarily consecutive ones. Given a marker 
map M with k markers m h ...,m h a "marker pattern" or "haplotype pattern P on M 
is defined as a vector (p;,..,p*), where each Pi is either an allele of marker m f or the 
"don't care" symbol (*). The haplotype pattern P occurs in a given haplotype vector 
Xo^sL^hjlM) IfpUi or Pi=* ^ all U<=*=*. Pattern P occurs in a 
given genotype G=({g„, g„}, .... {gk.*a» * or K=«2 °r Pi- ** a11 

i,l<=i<=k. 

For example, consider a marker map of 10 markers. The vector Pi = (*, 2, 5, *, 3, *, 
****), where 1, 2, 3,... are marker alleles, is an example of a haplotype pattern. 
This pattern occurs, for instance, in a chromosome with haplotype (4, 2, 5, 1, 3 2 6, 
4 5 3) The pattern also occurs in the genotype ({2,5}, {2,3}, {1,5}, {4,6}, {3,6}, 
{2,4}, {1,2}, {1,4}, {3,5}, {1,6}). (For instance, {2,5} is the genotype of marker 
1; the alleles are 2 and 5, but their phases are not known.) 
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Our goal is to search for haplotype patterns that roughly correspond to haplotypes 
identical by descent in the disease-associated. In doing this, there are two major is- 
sues with respect to the shapes of haplotype patterns: the genetic length of the sig- 
nificant part of the patterns, and gaps. We define the "(genetic) length" of a haplo- 

5 type pattern P^jLpk) as the maximum distance, in Morgans, between any two 
markers m^mj with Pi * * *Pf Searching for haplotype patterns of arbitrary length 
hardly makes sense; it is unlikely that genetically extremely long patterns will be 
discovered, at least not in significant numbers. Consequently, when haplotype pat- 
terns are searched for, the maximum length of patterns to be considered can be con- 

10 strained with an optional pattern-search parameter to the HPM-G method. 

We allow for gaps in the marker patterns, since mutations, gene conversions, errors, 
missing data, and recombinations can corrupt continuous haplotypes. Marker muta- 
tions and errors typically cause very short gaps only. Missing informauon can span 
several consecutive markers, depending on the data collection scheme. Longer gaps 
15 can be introduced by double recombinations which, however, are rare on genetically 
short distances. In the HPM-G method, the maximum number and maximum length 
of gaps can be controlled with pattern search parameters. 

Mining Disease-Associated Haplotype Patterns 

All marker patterns P that satisfy a pattern evaluation function e(P) are searched 
20 from the data in the step (i) of the method of the invention. The pattern evaluation 
function e(P) involves some statistical measure of the association between the 
mSer pattern P and the phenotype being studied. In step (n). each marker m f of the 
data is scored by a marker score s( mi ), which is a function of the set S t defined as 
the set of marker patterns overlapping the marker m; and satisfying the pattern 
25 evaluation function e as defined in step (i). 

In step (i), let U be the universe of marker patterns considered in the study. The 
output S of this phase is the set of those marker patterns that satisfy e, i.e., S - (P .e 
U\e(P)istrue). 

In step (ii), for each marker m t in the data, let S t = (P e V \ P makes a reference to 
30 mi or to a marker to the left of and to a marker to the right ofmO be the set of pat- 
terns in 5 that overlap with the marker m, . In this phase each marker m is scored as 
a function of Sj_ and the result is s(mi). 

In step (iii), the location of the gene is predicted as a function of the scores t(m0 of 
all the markers m t in the data. This function returns an area where the gene is likely 
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to be. The area can be contiguous or fragmented, and it can be a point in a special 
case. 

The marker or haplotype patterns P can be searched for by an algorithm developed 
by the inventors for this purpose or by the levelwise search method described in 
5 Mannila and Toivonen (1997). Preferred algorithms are given in the following. 

Version 1 of the algorithm for rp arlrer pattern searching 

The following algorithm is a simple, generic, and efficient way to implement step 
(i) of the method according to the invention. It is based on depth-first search in the 
space of patterns, a standard procedure in computer science. A pre-requisite is that 
10 there is a suitable generalization relation for the patterns, such that if a pattern satis- 
fies the evaluation function, then all more general patterns also satisfy it. 

Input 

• set U of possible marker patterns 

• evaluation function e(P) for patterns PinU 
15 • (generalization) relation < for patterns in U 

• where the function e and the relation < are such that if e(P) is true and P' < P, 
then e(P*) is also true 

t 

Output 

• setS= {Pe U\ e(P) is true} of patterns 

20 Method 

1. S: = {) 

2. II Initialize the set of evaluated patterns: 

3. E:={) 

4. // Start with the most general patterns: 

25 5. Gen := {P in U \ there is no P' in U, P' != P, such that P' < P) 

6. // Recursively evaluate patterns in a depth first order: 

7. foreach P e Gen { evaluatePatterns(P) } 

8. end; 

9. procedure evaluatePatterns(P) { 
30 10. insert P into the set E 

11. if e(P) = true then { 

12. insert P into set S 

13. // Find all specializations of P that have not been tested yet, and 

14. // evaluate them recursively: 
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Spec := [P' in U-E | P < P\ P' != P> and there is no P" in U-E, P" != P 
12 and P" != P', with P<P"< P'i: 

17. foreach P' in Spec { evaluatePatterns(P'); } 

18. } 
5 19.} 

VprQinn 7. nf the algor ithm for marker pattern searching 

The following algorithm is a simple, generic, and efficient way to implement step 
(i) of the method according to the invention. It is based on depth-first search in the 
space of patterns, a standard procedure in computer science. It approximates the 
10 exact answer by ignoring infrequent and therefore statistically less important pat- 
terns. 

Define an auxiliary evaluation function ae(P) which is true if and only if the fre- 
quency of pattern P exceeds a given threshold x, (how to specify the threshold is 
discussed elsewhere) and replace the original evaluation function e(P) by function 
15 e'(P) defined as e'(P) = true if and only if e(P) is true and ae(P) is true. This re- 
finement prunes patterns that satisfy e but are no more frequent than x. Such infre- 
quent patterns are statistically not relevant, and therefore little information is lost 
when they are ignored. Now a suitable generalization relation is obtained from logi- 
cal implication based on the pattern syntax: P < P' if and only if P 1 -> P- 

20 The algorithm uses the generalization relation based on logical implication to struc- 
ture the search space, and the auxiliary function ae to prune the search space. All 
patterns satisfying ae are searched for, but only those also satisfying e are output. 

Input 1 

• set U of possible marker patterns 

25 • evaluation function e(P) for patterns P in U 

• frequency threshold x 

T^ttS = [P in U | e{P) and ae(P) is true} of patterns, where ae(.P) is true if and 
only if the frequency of pattern P exceeds a given threshold x 

30 Method 
20.S:= {} 

21 . // Initialize the set of evaluated patterns: 

22. £:=U 
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23.11 Start with the most general patterns: 

24. Gen := {P in U \ there is no P' in U, P' != P, such that P->P'} 
25.11 Recursively evaluate patterns in a depth-first order: 
26.foreach P in Gen { evaluatePatterns(P) } 

5 27. end 

28. procedure evaluatePatterns(P) { 

29. insert P into the set E 

30. if ae(P) = true then { 

31 if e(P) = true then insert P into set 5 

10 32. // Find all specializations of P that have not been tested yet, and evaluate 

33. //them recursively: 

34 Spec := {P' in U-E \ P' -> P, P' != P, and there is no P" in U-E, P" !- P 

35* andP"\=P', with P' -> P" and P" -> P } 

36. foreach P' in Spec { evaluatePatterns(P ") } 

15 37. } 
38.} 

Version 3 of t he, al porithm for marker pattern searching 

When phenotype being studied is qualitative and the pattern evaluation function 
e(P) has the form e(P) = true if and only ife'(P) > x, where e'(P) is the (signed) as- 

20 sociation measure X 2 and x is a user-specified minimum value, which is chosen so 
that the sizes of S> are large enough, such as 7, to give statistically sufficiently reli- 
able estimates for the gene locus, the following algorithm is a simple, generic, and 
efficient way to implement step (i) of the method according to the invention. It is 
based on depth-first search in the syntactic space of patterns. It derives a lower 

25 bound lb for pattern frequency from the given lower bound x for chi-squared test, 
and uses lb to prune the search. 

Input 

• marker map M = (m/, ... 

• phenotype vector Y = (Yj, .... Y n ) 

30 • genotype matrix H of size n * k * 2 (n persons, k markers, 2 alleles per person 
and marker) 

• association threshold x for chi-squared test 

• maximum pattern length / 

• maximum number of gaps g 
35 • maximum gap size s 
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Output 

• set S = { P in U \ e(P) is true } of patterns, 

• where U consists of patterns on M that consist of marker-allele assignments and 
that adhere to parameters /, g, and i, and 

5 • where e(P) is true if and only if chi-squared test on P using genotype matrix H 
and phenotypes Y exceeds the given threshold x 

Method 
39.5 : = {} 

40. // Number of case and control persons: 
10 41. pi a •= number of affected persons ; 

42. pic '■= number of control persons; 

43. pi := piA + piC 

44. // A lower bound for pattern frequency: 
45. lb. = piA *pi *x/(pic*pi + P i A **) 

1 5 46. // Variable for iterating over different patterns: 

41. P = (pi P0 :=('*'. -.'*') 

48. for i:= ltok { 

49. // alleles(mi) is the set of alleles of the i:th marker 

50. foreach a in alleles(mj) { 
20 51. Pi-=a 

52. // Test pattern P and all its extensions: 

53. checkPatterns(P, i, i, 0, 0) 

54. // Reset pf. 

55. Pi-'*' 
25 56. } 

57. } 

58. end 

59. // Test haplotype pattern P and all patterns that can be generated by extending P 

60. //from the right: 

30 61. procedure checkPatterns(P, start, i, nr_of_gaps, gapjength) { 

62. // Output strongly associated patterns 

63. if chi-squared(P, M, H,Y)>=x and Pi != '*' then insert P into set S 

64. // Return if extended patterns would be too long: 

65. if i = k or i+1 -start > I then return 

35 66. // Return if extended patterns can not be strongly disease-associated: 

67. if frequency of P in disease-associated persons is less than lb 

68. then return; 



PCT/FI03/00248 

WO 03/085585 

P 

13 

// Create and test legal extensions of current pattern P (3 cases): 
// 1 . Give marker i+i all possible values: 
foreach a in alleles(m; + ;) { 

Pi+1 ■ *= a 

checkPatterns (P, start, i+1, nr_of_gaps, 0) 

II 2. Introduce a new gap starting at marker i+i: 
if pi*'*' and nr_of_gaps < g and s>l then { 

Pi+1 : = * , 
checkPatterns (P, start, i+i, nr_of_gaps+l, 1) 

(1 3. Extend the current gap over marker i+i: 
if pj = '*' and gapjength < s then { 

Pi+i- 

checkPatterns (P, start, i+1, nr_of_gaps, gap_length+l) 

) 

II Before returning, reset pi+j: 

Pi+1 ■= 
return 



VCTsion 4 of th » al porithm fo r marker pattern searching 

The following algorithm is a simple, generic, and efficient way to implement step 
(i) of the method according to the invention. It is based on the levelwise search 
method described in Mannila and Toivonen (1997). 

25 Input 

• set U of possible marker patterns 

• evaluation function e(P) for patterns P in U 

. (generalization) relation < for patterns in U, where the function e and the rela- 
tion < are such that if e{P) is true and P' < P, then «(/") is also true 

30 Output 

• set S = {P in U | e(P) is true} of patterns 
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. function Lgg: U -> 2 U , Lgg(P) = { P' in 1/ \P > P' and P' != P and there is no 
P" mU such that P != P" != P'andP> P" > P'), the set of least general gen- 
eralizations of pattern P. 

. function Lss:U->2 u ,Lss(.P)={P' in U \ P < P' and P' != P and there is no 
5 P" in U such that P != P" \= P' and P < P" < P'), the set of least special spe- 
cializations of pattern P. 

Method 
89.5: = {} 
90.Q: = [) 

10 91 . // Start with the most general patterns: 

92. F: = {PinU\ there is no P' in U, P' != P, such that P'<P); 

93. while F\={} { 

94. // Evaluate the candidate patterns: 

95. foreachPinF { 

15 96. if e(P) = true then insert P into set 5 

97. else remove P from set F | 

98. } 

99. Q: = QunionF 

1 00. // Generate a new set of candidate patterns: 

20 101. C:={} 

102. foreachPinF { 

103. C : = C union { P' in U \ P' in Lss(P) and for allP" m Lgg(P ): 

104. P"inQ} 

105. } 

25 106. F: = C 

107. } 

108. end 

Vprsinn 5 of fe e al gorithm for marker pattern searching 
This is the levelwise search version of the algorithm 2. 

30 Input 

• set U of possible marker patterns 

• evaluation function e{P) for patterns P in U 

• frequency threshold* 
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T^ctS ={PmU\ e(P) and ae(P) is true) of patterns, where ae(P) is true if and 
only if the frequency of pattern P exceeds a given threshold x 

Definitions 

5 . function L g8 : U -> 2", Lg*(P) -{P'itiU \ P->P' and P != P and there lS no 
p" m U such that P != P" != J" ™rf "> "> set of leaSt geneTal 

generalizations of pattern P. 
. function L«: i/ -> 2 U , Lss(P) = {P'inU | P' -> P ami P' != P aru* toe u no 
p" in U such that P != P" != P' anrf P' -> P" -> P>. the set of least special 

10 speciaUzations of pattern P. 

Method 

109. S: = {} 

110. Q:={} 

111. // Start with the most general patterns: 

15 1 12. F:= {P\nU\ there is no P' in U, P' != P such that P -> P' }; 

113. while F!= {} { 

1 14. // Evaluate the candidate patterns: 

115. foreachPinP { 

116. if ae(P) = frue then { 
1 17 if e (P) = true then insert P into set 5 

118. } 

1 19. else remove P from set F 

120. } 

121. £ : = g union F 
25 122. // Generate a new set of candidate patterns: 

123. C: = {} 

124. foreachPinF { 

125 . C : = C union { P' in U | P' in Lw(P) and for all P" in Lgg(P ): 
126*. P"inG) 

30 127. } 

128. F: = C 

129. } 

130. end 



20 
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The phenotype being studied may be qualitative, for example disease is present or 
disease is absent In that case, the pattern evaluation function e(P) may have the 
form e(P) = true if and only if e'(P) > x, where e'(P) is the (signed) association 
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measure % 2 and x is a user-specified minimum value, which is chosen so that the 
sizes of 5, are large enough, such as 7, to give statistically sufficiently reliable esti- 
mates for the gene locus and the score s(m0 of marker is the size of S, also 
called marker-wise pattern frequency of m/ and denoted by f(mi). 

5 As mentioned above, the (signed) 5c 2 is a measure of marker-disease association. A 
signed version of the measure is used in order to discriminate disease association 
from control association. The signed X 2 measure ±£(F) of a haplotype pattern P is 
positive if P is more frequent in cases than in controls, and negative otherwise. 
Given a "(positive) association threshold" x, we say that P is "strongly associated" 
10 with the disease if ±£(P)>x- 

The first part of the HPM-G method can be described as follows. Given the data — 
markers M, genotypes H, and phenotypes Y - the task is to output all haplotype 
patterns P that are strongly associated with the disease status for a given value of 
the association threshold x. We denote the collection of all such haplotype patterns 
15 by s - that is, S = {P is a haplotype pattern on M | ± % 2 (P)>x} . If pattern parame- 
ters are specified - a maximum genetic length, a maximum number of gaps, or a 
maximum length for gaps - the task is refined by requiring that these additional 
restrictions are also fulfilled. 

The signed % 2 value is calculated from a 2x2 contingency table, where the rows cor- 
20 respond to the trait-association statuses of the persons, and the columns correspond 
to the presence and absence of the haplotype pattern. A pattern P=(pi....,Pk) » Pre- 
sent in a given genotype G=({g n , gal. (fti-gul) if Pi=8il or Pi - 8 i2 or Pi -* for 
all i \<=i<=k. If, instead of a genotype, two haplotype vectors H,=(hu,...,hlk) and 

H 2 =( h21 h2k ) are given for a person, pattern P is considered to be present in the 

25 person if it is present in either of the haplotypes, i.e., if either pi=hn or Pi - * for all 
i,\<=i<=k, or Pi =h2iorpi=*foralli,l<=i<=k. 

The value of % 2 statistic is computed normally, and a negative sign is attached, if the 
relative frequency of the haplotype pattern among the control persons is higher than 
among the trait-associated persons. 

30 The first observation in solving the pattern-mining task is that given an association 
threshold x, a lower bound can be derived for the frequency of strongly associated 
haplotype patterns as follows: 
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Given a 2x2 contingency table of the numbers of disease-associated (A) and control 
(Q persons either matching a pattern (P) or not (N), the % 2 test statistic for the dis- 
ease association of the pattern is defined by 

fo-,„ -n r „ -n M -n^-n 
n A -n c -n P -n H 

5 where n i} is the number of persons with properties i and j, %i the number of persons 
with property i, and n the total number of persons. Given the number of affected 
jS^oS? the number of control persons (n C ), and a lower bound x for the test 
statistic we can derive a lower bound for the pattern frequency among the affected 
persons l AP ) as follows. Assuming the pattern is disease-associated, we have 

0 *AP ■ *CN > *AN ' *CP- The test statistic is maximized when n C P = ^ ™plying 
n A p = np and hqn = 71 C- Tnen 

n A -n c -n e -n N n A -n c -n AP -{n-n P ) n A \n-n AP ) 

and 



n A \n-n AP Y n c -n^n 



A X 



15 The situation is symmetric for protective haplotypes, and the lower bound for n C P 
is obtained by simply swapping n A and n C in the above result. If disease-associated 
and protective haplotypes are searched for at the same time, the smaller of n AP and 
n CP can be used as a lower bound for n P , making the implementation slightly sim- 

pier. 

20 On another hand, given such a frequency threshold, all patterns exceeding the 
threshold can be enumerated efficiently with data-mining algorithms or a standard 
depth-first search method. An algorithm that first finds all haplotype patterns whose 
frequency exceeds the computed lower bound and then evaluates the association 
measure on them, is guaranteed to find the exact set of strongly disease-associated 

25 patterns. 

The approach is suitable for finding protective haplotype patterns by considering 
patterns P with ±% 2 (P) < -x. The derivation of the lower bound for the frequency 
among controls is identical to the case above. Obviously, both disease-associated 
and protective haplotypes can be found when \±£(P)\ > x. 



10 
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In another embodiment according to the invention, the phenotype being studied may 
be in addition to qualitative, also quantitative, for example a measured blood con- 
centration of a substance has a certain value. In that case, the pattern evaluation 
function e(P) may have the form e(P) = true if and only ife'(P) > x, where e'(P) is 
; the absolute frequency of pattern P in the data and x is a user-specified value, which 
is chosen so that the sizes of 5, are large enough, such as 20, to give statistically suf- 
ficiently reliable estimates for the gene locus. In this embodiment, the statistical 
strength of the method may be still increased. 

A linear model is of form Y = Mi + - + A** +<&+&>< where the dependent 
variable Y is a quantitative phenotype, X, through X, are covariates, such as envi- 
ronmental factors, and Z is a dummy variable for the occurrence of the haplotype 
pattern Firstly, the coefficients and P* are adjusted for best fit. Secondly, the sig- 
nificance of Z as a covariate is assessed by using a t test. If the phenotype is di- 
chotomous, then the logit transformation can be applied. 

15 Marker scoring in the case of qualitative phenotype being studied 

Haplotype patterns close to the DS locus are likely to have stronger association than 
haplotypes further away; consequently the locus is likely to be where most of the 
stroneest associations are. In one embodiment according to the invention the 
SrscoTS is defined as the marker frequency /(m,) of marker « (with re- 
spect to M,H,Y^c) as the number of patterns that contain marker m;, possibly mi m a 
gap: Mi) = \{P =(Pl,.;Pk) 6 S | there exist t < i and u > i such that p t * * * Pu )\- 
The idea is that each haplotype pattern roughly corresponds to a continuous chro- 
mosomal region, potentially identical by descent, where gaps allow for corruption 
of marker data. While markers within gaps are not used in measuring the disease 
association of the pattern, the whole chromosomal region of the pattern is thought to 
be relevant. 

Marker scoring in the case of qualitative or quantitative phenotype being studied 
In order to derive the score the p value (statistical significance) of each 

marker pattern P in determining the phenotype being studied is evalua ted and I the 
Zns(mO is the distance between the observed p value distribution of patterns in S, 

and the uniform distribution, defined as average of (p t - qd log (£ Iqd J« ^ j» " 
I n where n is the number of haplotype patterns in S it Pi is the ith smallest p value 
in S h and * is the expectation of the ith smallest p value, if the p values were ran- 
domly drawn from the uniform distribution. 
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Gene localization 

The location of the gene, predicted as a function of the scores s(mO and based on 
maximizing or minimizing the score, is predicted to 

- the location of the marker m/ that maximizes or minimizes the marker score s(mi), 



or 
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- the combination of most probable intervals for containing the trait-susceptibihty 
locus that covers at most the desired proportion t (IE {0,100%}) of the original re- 
gion obtained by taking all such points in the studied chromosomal region whose 
nearest marker is within the k best scoring markers, where k is selected such that the 
resulting area has length at most t times the length of the studied region, and where 
k is maximal such value, or 

- those points in the studied chromosomal region whose nearest marker scores at 
least y or at most y, where y is scoring function dependent and is selected so that the 
probability of the gene being close to the marker is sufficiently large. 

The location of the gene may also be determined by expert investigation of the 
marker scores or their visualization e.g; as a curve. 

Permutation Tests 

More information about the significance of the observed scores may be obtained by 
permutation tests. The results obtained by considering the marker frequencies or the 
linear model, as explained earlier, can be contrasted against the null hypothesis that 
all the persons are drawn from the same distribution; that is, there is no gene effect 
in the disease status. We propose to permute randomly the status fields of the per- 
sons keeping the proportions of affected and control persons constant, in a fashion 
similar to the method of Churchill and Doerge (1994). We approximate marker- 
wise p values using permutations and then predict the DS gene to be in the vicinity 
of the marker with the smallest empirical p value. Consecutive markers are depend- 
ent and thus a large number of mutually dependent p values are produced. This is 
not a problem, since we do not use the p values for hypothesis testing, but only for 
ranking markers. 

Marker-wise p values are used to re-score markers by their statistical unexpected- 
ness The test is carried out as follows: The phenotypes of the persons are randomly 
shuffled a number (thousands) of times. The scores are re-calculated for each per- 



• 



PCT/FI03/00248 

WO 03/085585 

20 

mutation in turn. Marker-wise p value p(m0 is the proportion of such permutation 
scores for marker m f that are larger than or equal to the non-permuted score. 

Each score s(m0 is then refined by replacing it by the marker-wise p value p(m0 of 
the score s(m0. 



5 Searching several genes 

Several genes may be searched for simultaneously by using marker patterns that 
refer to several potential gene loci at the same time. 

Examples 

Certain embodiments and results of the present invention are described in the fol- 
io lowing non-limiting examples. 

Example 1 - Simulated Data Sets 

We evaluated the performance of the proposed HPM-G method with simulated data 
sets that correspond to a recently founded, relatively isolated founder ^popula- 
tion Simulation of a population isolate was chosen, since it is recommended as the 
15 study population for LD studies. However, the method can be applied to any popu- 
lation that is suitable for LD analysis, since no assumptions are made about the 
population structure. 

An isolated founder population, which grows from the initial size of 200 to 100,000 
individuals in 20 generations, was simulated. 
20 The population pedigree was first generated assuming distinct generations and ex- 
ponential growth of the population size. In each generation, the parents of the new- 
born individuals were randomly selected from members of the previous generation, 
with the exception that whenever a parent with at least one child was chosen, his/her 
spouse was always forced to become the other parent of the child. Tins procedure 
25 generates family structure into each generation. 

In the simulation of inheritance, each member of the first generation was assigned 
to have one pair of homologous chromosomes. The genetic length of the chromo- 
somes was 100 cM for both males and females. Meiosis was repeatedly simulated, 
and in each meiosis the number of crossover points was taken from a Poisson distn- 
30 bution with parameter value 1, which corresponds to the total genetic length of the 
chromosome. No chiasm interference was modeled. To accommodate the fact that 
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ever-increasing informativeness of marker maps may soon facilitate whole-genome 
LD mapping, we used relatively dense and informative marker maps with inter- 
marker intervals of exactly one cM. Each marker contained 4 alleles, whose fre- 
quencies in the founder population were 0.4 for one allele and 0.2 for the remaining 
5 three alleles. The polymorphism information content (PIC) of each marker was thus 
fixed at 0.678. 

To produce each of the 100 data sets for HPM-G and HPM, the processes of disease 
locus selection, diagnosing and sampling were done independently. Next, these 
processes are described. 

10 For each data set, a random locus was selected as a disease locus, and 8 random 
chromosomes present in the original population were labeled as disease-carrying 
chromosomes. In the final population, all chromosomes that had inherited the dis- 
ease locus identical-by-descent from one of the eight founders were considered to 
carry a disease-causing mutation. 
15 In the diagnosing phase, we used a liability-based model, where an individual's 
probability of becoming affected depends on two factors: the presence of a disease 
mutation and a normally distributed random component. The random component is 
thought to contain factors such as environmental effects and effects of other un- 
known genes. The liability of an individual is defined as L = 5x 1+ * 2 + C, where 
20 indicator variable x, indicates the presence of any of the disease-causing mutations, 
and variable x 2 is randomly sampled from standard normal distribution. Based on 
. the generated segment data, the value of constant C is set in such a way that the de- 
sired population prevalence of 5 per cent is reached. The liability value L defines 
the probability of an individual to be affected, denoted by p, via formula 
= L . Having discovered the affection status of each individual, the desired 
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number of individuals labeled as affected was randomly selected to constitute the 
affected sample. 

The control samples were created using two different methods: for HPM-G that util- 
izes genotype data, the control individuals were simply taken randomly from the 
entire population. To do this, the sampling process described above was repeated, 
but this time the liability of each individual was purely random, including no ge- 
netic component. 
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For the original HPM that requires haplotype data, we used a more laborious sam- 
pling method: the genotypes of the parents of the affected individuals were col- 
lected to create family-based pseudocontrol chromosomes. This was done in prac- 
tice by taking the alleles in the non-transmitted chromosomal segments of the par- 
5 ents of each affected individual and labeling them as control chromosomes. In real- 
ity, this is a common practice. In the simulations we treated the haplotypes obtained 
from the simulator as given, which corresponds to error-free haplotyping, and is 
expected to slightly favor HPM in the comparisons. 

The simulation of missing data was based on the notion that in real genotyping 
10 laboratories there seems to be two types of clustering of missing data. First, missing 
genotypes tend to cluster to certain individuals, which can be a consequence of low 
quality samples. Second, certain markers may function poorly, likely producing 
missing genotypes. To mimic such clustering of missing data, we defined two pa- 
rameters: parameter a corresponds to the amount of missing data that clusters to 
15 individuals and parameter 0 to the amount that clusters to markers. The missing 
genotypes were selected using the following procedure: 

For each individual £, a personal missing genotype probability was computed 
as the x value of the first random point in (x, y) plane (x, ye [0,1]) that satisfies the 
inequality y<lle m . Having computed the value of variable x\ for the individual, 

20 each of his/her genotypes was then labeled as missing with probability x\ . In the 
second phase, the procedure was repeated for each marker. For each marker j, a 
marker failure probability x» was computed in an analogous fashion as the x value 
of the first random point in (*, y) plane (*, ye [0,1]) that satisfies the inequality 
y < i/ e * , and each genotype corresponding to that marker was labeled as missing 

25 independently for each individual with probability xf . 

Values of variables a and £ were empirically adjusted to produce the desired over- 
all levels of missing data. These values were: 25 and 80 for 5%, and 13 and 40 for 
10% of missing data. 

Example 2 - Comparison to HPM 

30 The localization accuracy was explored by plotting curves similar to power graphs: 
the height of the curve shows the fraction of data sets for which the localization was 
successful, as a function of the length of the predicted region. The sample consisted 
of 150 affected and 150 control genotypes. The maximum length of a pattern was 7, 
and one gap of one marker was allowed. The association threshold was set to 10. 
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These numbers were based on experimentation. For comparison, we also show the 
corresponding curve for HPM with 1/3 smaller sample size, and thus equal genotyp- 
ing cost (figure 1). With HPM we used association threshold 9, the parameters for 
the patterns were the same than those used with HPM-G. 

5 The results show that HPM-G has a high accuracy, and that it is extremely competi- 
tive even in comparison to state-of-the-art methods that use explicitly haplotyped 
data. 

Example 3 - effect of sample size 

The effect of sample size was examined by experimenting with sample sizes of 
10 100+100, 150+150, 200+200 and 300+300 people (figure 2a). Figure 2b shows the 
corresponding results for HPM. 

HPM-G performs well even with only 100+100 genotypes. On the other hand, if 
the amount of data is increased, the accuracy is improved. 

Example 4 - effect of missing data 

15 The influence of missing data was explored by randomly removing 5% or 10% of 
marker genotypes (figure 3a). Figure 3b shows the corresponding results for HPM. 

These results show that HPM-G is very robust against missing data. 

Example 5 - Localization Accuracy with Permutation Tests 

Permutation tests were used to obtain more information about the significance of 
20 observed marker frequencies. Marker-wise P values were used to sort markers by 
their statistical unexpectedness, not to test the statistical significance of the findings. 
We performed the following experiment in order to see if the prediction accuracy 
can be improved by permutation tests. We predicted the location of the DS geneto 
be at the marker with the smallest P value instead of the most frequent marker. The 
25 localization accuracy with 100 permutations compared to that without permutations 
is shown in figure 4. The curves are almost identical, which is due to the evenly dis- 
tributed and identically informative markers. 

The situation could be different with real marker data, where permutation tests are 
likely to bring a greater benefit. 
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Claims 

1 A method for gene mapping from genotype and phenotype data, which method 
utilizes linkage disequilibrium between genetic markers m it which are polymorphic 
nucleic acid or protein sequences or strings of single-nucleotide polymorphisms 
5 deriving from a chromosomal region, characterized in that 

i) all marker patterns P that satisfy a pattern evaluation function e(P) are 
searched from the data, wherein 

a the marker patterns are expressions involving the marker-allele assign- 
ments and zero or more of the following: individual covariates, environ- 
10 mental variables and auxiliary phenotypes; and 

b the pattern evaluation function e(P) involves some statistical measure of 
the association between the marker pattern P and the phenotype being 
studied, 

by testing each marker of pattern P against the corresponding allele pair in 
15 genotype G, effectively finding out if there is a possible haplotype configura- 

tion of G which matches P and counting the possible matches as matches, 

ii) each marker ^ of the data is scored by a marker score s(m0, which is a func- 
tion of the set S, defined as the set of marker patterns overlapping the marker 
mi and satisfying the pattern evaluation function e as defined in step (i), and 

20 Hi) the location of the gene is predicted as a function of the scores s(m0 of all the 
markers m f in the data and is based on maximizing the score if the sconng 
function is designed to give higher scores closer to the gene, and on minimiz- 
ing the score if the scoring function is designed to give lower scores closer to 
the gene, as is the case for instance when the scores s(m0 are marker-wise p 

25 values 

2. A method according to claim 1, characterized in that a marker is scored as 
the sum of the weights of overlapping patterns. 

3. A method according to claim 2, characterized in that the weight of a pattern is 
a function of 
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- the uncertainty of matching, e.g. 2 1Nt,J , where N[i] is the number of het- 
erozygous markers within the pattern in genotype i, summed over all 
matched genotypes, or 

- the informativeness of the pattern, e.g. 2 H , where H is the average heterozy- 
5 gosity within the pattern, or 

- the strength of association, e.g. chi-squared. 

4. A method of claim 1, characterized in that the marker patterns P are searched 
for by the following algorithm: 

Input 

10 • set U of possible marker patterns 

• evaluation function e(P) for patterns P in U 

• (generalization) relation < for patterns in U 

• where the function e and the relation < are such that if e(P) is true and P' < P, 
then eif) is also true 

15 Output 

• set S = [P e U | e{P) is true) of patterns 

Method 
1- S. = {) 

2. //Initialize the set of evaluated patterns: 
20 3. E := { } 

4. // Start with the most general patterns: 

5. Gen := {P in U \ there is no P' in U, P' != P, such that P' < P) 

6. // Recursively evaluate patterns in a depth first order: 

7. foreach P 6 Gen { evaluatePatterns(P) } 
25 8. end; 

9. procedure evaluatePatterns(P) { 

10. insert P into the set E 

11. if e(P) = true then { 



12. insert P into set S 

30 13. // Find all specializations of P that have not been tested yet, and 

14. // evaluate them recursively: 

15. Spec := {P' in U-E \ P < P', P' != P, and there is no P" in U-E, P" != P 

1 6. and P" != P\ with P < P" < P 'J; 

17. foreach P' in Spec { evaluatePatterns(P'); } 



m 
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18. 

19.} 



} 



5. A method of claim 1, characterized in that the marker patterns P are searched 
for by the following algorithm: 

5 Input 

• set U of possible marker patterns 

• evaluation function e(P) for patterns PinU 

• frequency threshold x 

Output 

10 • set 5 = [P in U \ e(P) and ae(P) is true} of patterns, where ae(P) is true if and 
only if the frequency of pattern P exceeds a given threshold x 



20.S: = {} 

21.// Initialize the set of evaluated patterns: 
15 22.E:={} 

23. // Start with the most general patterns: 

24. Gen := [P in U \ there is no P' in U, P' != P, such that P -> P' } 

25. // Recursively evaluate patterns in a depth-first order: 

26. foreach P in Gen { evaluatePatternsCP) } 
20 27. end 

28. procedure evaluatePatterns(P) { 

29. insert P into the set E 

30. if ae(P) = true then { 

31. if e(P) = true then insert P into set 5 

25 32. // Find all specializations of P that have not been tested yet, and evaluate 

33. // them recursively: 

34. Spec := [P' in U-E \ P' -> P, P' != P, and there is no P" in U-E, P"\=P 

35. and P" != P', with P' -> P" and P" -> P } 

36. foreach/" in Spec { evaluatePatternsCP') } 



Method 



30 



37. 
38. } 



6. A method of claim 1, characterized in that the marker patterns P are searched 
for by the following algorithm: 
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Input 

• marker map M = (mj, ... ,m#) 

• phenotype vector Y = (Y], .... Y n ) 

• genotype matrix H of size n * k * 2 (n persons, k markers, 2 alleles per person 
5 and marker) 

• association threshold x for chi-squared test 

• maximum pattern length I 

• maximum number of gaps g 

• maximum gap size s 

10 Output 

• set S = .{P in U \ e(P) is true) of patterns, 

• where U consists of patterns on M that consist of marker-allele assignments and 
that adhere to parameters /, g, and i, and 

• where e(P) is true if and only if chi-squared test on P using genotype matrix H 
1 5 and phenotypes Y exceeds the given threshold x 

Method 
39.S: = {} 

40. // Number of case and control persons: 

41. piA := number of affected persons; 
20 42. pic := number of control persons; 

43. pi:=piA+piC 

44. // A lower bound for pattern frequency: 

45. lb := piA * pi *x/{piQ *pi+ piA * *) 

46. // Variable for iterating over different patterns: 
25 47.P = (p; Pk) ••=('*'...• . ""> 

48. for i := 1 to k { 

49. // alleles(m,-) is the set of alleles of the i:th marker 

50. foreach a in alleles(mj) { 

51. pf.-a 

30 52. // Test pattern P and all its extensions: 

53. checkPatterns(P, i, i, 0, 0) 

54. //Reset /?;: 

55. Pi := '*' 

56. } 
35 57. } 

58. end 



WO 03/085585 



30 



PCT/ET03/00248 

I 



59. // Test haplotype pattern P and all patterns that can be generated by extending P 

60. //from the right: 

6 1. procedure checkPatternsCP, start, i, nr_of_gaps, gapjength) { 

62. // Output strongly associated patterns 

5 63. if chi-squared(P, M, H,Y)>=x and pi != '*' then insert P into set S 

64. // Return if extended patterns would be too long: 

65. if i = k or i+1 -start > / then return 

66. // Return if extended patterns can not be strongly disease-associated: 

67. if frequency of P in affected persons is less than lb 
10 68. then return; 

69. // Create and test legal extensions of current pattern P (3 cases): 

70. // 1. Give marker i+1 all possible values: 

71. foreach a in alleles(mj + 7) { 

72. p i+ j :=a 

15 73.checkPatterns (P, start, i+1, nr_of_gaps, 0) 

74. } 

75. // 2. Introduce a new gap starting at marker i+1: 

76. if pi *'*' and nr_of_gaps <gands>l then { 
ll.pi+j := '*' 

20 78.checkPatterns (P, start, i+1, nr_of_gaps+l, 1) 

79. } 

80. // 3. Extend the current gap over marker i+1: 

81. if pi = '*' and gapjength < s then { 

S2.p i+1 := '*' 

25 83. checkPatterns (P, start, i+1, nr_of_gaps, gap_length+l) 

84. } 

85 . // Before returning, reset pi+] : 

86. p I+7 := '*' 

87. return 
30 88. } 

7 . A method of claim 1 , characterized in that the marker patterns P are searched 
for by the following algorithm: 

Input 

• set U of possible marker patterns 
35 • evaluation function e(P) for patterns P in U 



• 
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• (generalization) relation < for patterns in U, where the function e and the rela- 
tion < are such that if e(P) is true and P' < P, then <?(/") is also true 

Output 

• set S = [P in U \ e(P) is true) of patterns 
5 Definitions 

• function Lgg: U -> 2 U , Lgg(P) = { P' in U \ P>P' and P' != P and there is no 
P"'mU such that P \-P" l=P' and P > P" > P'}, the set of least general gen- 
eralizations of pattern P. 

• function Lss: U -> 2 U , Lss(P) = { P' in U | P < P' and P' != P and there is no 
10 P" in U such that P\=P" != P' and P < P" < P'}, the set of least special spe- 
cializations of pattern P. 

Method 
89.5: = {} 
90.Q: = {} 

15 91 .// Start with the most general patterns: 

92. F : = {P in U \ there is no P' in U, P' != P, such that P' < P}; 

93. while F != {} { 

94. // Evaluate the candidate patterns: 

95. foreachPinF { 

20 96. if e{P) = true then insert P into set S 

97. else remove P from set F 

98. } 

99. Q : = Q union F 

100. // Generate a new set of candidate patterns: 
25 101. C: = {} 

102. foreachPinF { 

103. C : = C union { P' in U \ P' in Lss(P) and for all P" in Lgg(P'): 

104. P"inQ) 

105. } 

30 106. F: = C 

107. } 

108. end 



8. A method of claim 1 , characterized in that the marker patterns P are searched 
for by the following algorithm: 
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Input 

• set U of possible marker patterns 

• evaluation function e(P) for patterns P in U 

• frequency threshold x 

5 Output 

• set S = {P in U \ e(P) and ae(P) is true) of patterns, where ae(P) is true if and 
only if the frequency of pattern P exceeds a given threshold x 

Definitions 

• function Lgg: U -> 2 U , Lgg{P) ={P'mU | P -> P' and P' != P and there is no 
10 P" in U such that P != P" != P' and P -> P" ->P'}, the set of least general 

generalizations of pattern P. 

• function Lss: U -> 2 U , Lss(P) = { P' in U \P' -> P and P' != P and there is no 
P" in U such that P != P" != P' and P' -> P" -> P), the set of least special 
specializations of pattern P. 

15 Method 

109. S : = { } 

110. Q: = {) 

111. // Start with the most general patterns: 

1 12. F := {Pint/ | there is no P' in U, P' != P, such that P -> P' }; 
20 113. while F!= {} { 

1 14. // Evaluate the candidate patterns: 

115. foreach.PinF { 

116. ifae(P) = frwethen { 

117. if e(P) = true then insert Pinto set S 
25 118. } 

1 19. else remove P from set F 

120. } 

121. Q: = 2unionP 

122. // Generate a new set of candidate patterns: 
30 123. C : = { } 

124. foreachPinF { 

125. C : = C union { P' in U \ P' in Lss(P) and for allP" in Lgg(P'): 



126. 

127. } 

128. F 

129. } 



P"ing } 



35 



: = C 



WO 03/085585 



PCT/FI03/00248 



33 

130. end 

9. A method of claim 1, characterized in that 

a) the phenotype being studied is qualitative, and 

b) the pattern evaluation function e(P) has the form e(P) = true if and only if 
5 e'(P) > x, where e'(P) is the (signed) association measure % 2 and x is a user- 
specified minimum value, which is chosen so that the sizes of 5,- are large 
enough, such as 7, to give statistically sufficiently reliable estimates for the 
gene locus, and 

c) the score s(mi) of marker m,- is the size of 5„ also called marker-wise pat- 
10 tern frequency of m; and denoted by f(mj). 

10. A method of claim 1, characterized in that 

a) the pattern evaluation function e(P) has the form e(P) = true if and only if 
e'(P) > x, where e'(P) is the absolute frequency of pattern P in the data and 
x is a user-specified value, which is chosen so that the sizes of 5, are large 

15 enough, such as 20, to give statistically sufficiently reliable estimates for 

the gene locus, and, 

b) in order to derive the score s(mi), the p value (statistical significance) of 
each marker pattern P in determining the phenotype being studied is evalu- 
ated, and 

20 c) the score s(m0 is the distance between the observed p value distribution of 

patterns in 5, and the uniform distribution, defined as average of (p, - qi) 
log (pi I qi) over all i = l..n, where n is the number of haplotype patterns in 
Si, pi is the ith smallest p value in and q t is the expectation of the ith 

smallest p value, if the p values were randomly drawn from the uniform 
25 distribution. 

11. A method of claim 10, characterized in that the p value is computed using a 
linear model of form Y = /3\X { + . . . + f3 k X k + oZ + where the dependent variable 
Y is the phenotype being studied, Xi through X k are covariates, such as environ- 
mental factors, and Z is a dummy variable for the occurrence of the haplotype pat- 

30 tern, and 



the coefficients or and /% are adjusted for best fit, and then 



m 
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the significance of Z as a covariate is assessed by using a t test with the null hy- 
pothesis "a= 0". 

12. A method of claim 1, characterized in that each score s(m0 is refined by re- 
placing it by the marker-wise p value of the score s(nii), where the statistical signifi- 

5 cance of s(mi) is measured against the null hypotheses that there is no gene effect. 

13. A method of claim 12, characterized in that the marker- wise p values p(m0 
are determined by randomly permuting phenotypes. ' 

14. A method of claim 1, characterized in that the area returned from the predic- 
tion of the gene location is contiguous or fragmented or a point. 

10 15. A method of claim 1, characterized in that the location of the gene, predicted 
as a function of the scores s(m0 and based on maximizing or minimizing the score, 
is predicted to the location of the marker m/ that maximizes or minimizes the mar- 
ker score s(m0. 

16. A method of claim 1, characterized in that the location of the gene, predicted 
15 as a function of the scores s(nti) and based on maximizing or minimizing the score, 

is predicted to the combination of most probable intervals for containing the trait- 
susceptibility locus that covers at most the desired proportion t (re {0,100%}) of the 
original region obtained by taking all such points in the studied chromosomal region 
whose nearest marker is within the k best scoring markers, where k is selected such 
20 that the resulting area has length at most t times the length of the studied region, and 
where k is maximal such value. 

17. A method of claim 1, characterized in that the location of the gene, predicted 
as a function of the scores s(m0 and based on maximizing or minimizing the score, 

is predicted to those points in the studied chromosomal region whose nearest 
25 marker scores at least y or at most y, where y is scoring function dependent and is 
selected so that the probability of the gene being close to the marker is sufficiently 
large. 

18. A method of claim 1, characterized in that the location of the gene, predicted 
as a function of the scores s(m0 and based on maximizing or minimizing the score, 

30 is determined by expert investigation of the marker scores or their visualization. 
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19. A method of claim 1, characterized in that several genes are searched for si- 
multaneously by using marker patterns that refer to several potential gene loci at the 
same time. 

20. A computer-readable data storage medium having computer-executable pro- 
gram code stored, characterized in that it is operative to perform a method of any 
of the preceding claims when executed on a computer. 

21. A computer system, characterized in that it is programmed to perform the 
method of any of the claims 1 to 19. 
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HPM-G: effect of sample size 
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HPM-G: effect of missing data 
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